# Chapter 3: Advanced Manipulation on Time Series

Time series data captures sequential information over consistent time intervals. However, for an effective analysis and modeling, further preprocessing techniques and special type of operations are often applied. Some of the insights one can gain from those operations include  stabilizing the mean of a series,  smoothening the dynamics of the series, reducing noise and highlighting underlying trends just to name a few. 

Practical use cases for these operations are critical when working on forecasting stock prices for instance, analyzing weather patterns, predicting customer demand, and detecting anomalies in sensor data. By mastering these operations, you can gain valuable insights into temporal patterns, identify meaningful trends, and make informed decisions based on historical data.


**Learning Objectives:**

In this section, you will gain a comprehensive understanding of time series preprocessing techniques including:

1. Stabilizing the mean of a series through Differentiation operations

2. Smoothening the series or reducing the noise through rolling mean or moving average

3. Enriching signal clarity or isolating cyclic patterns within a series through filtering techniques

3. Reducing the computational complexity through downsampling mechanisms

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

## Components of Time Series

Time series analysis provides a body of techniques to better understand a dataset. Perhaps the
most useful of these is the decomposition of a time series into 4 constituent parts:

* **Level.** The baseline value for the series if it were a straight line.

* **Trend.** The optional and often linear increasing or decreasing behavior of the series over time.

* **Seasonality.** The optional repeating patterns or cycles of behavior over time.

* **Noise.** The optional variability in the observations that cannot be explained by the model.

These constituent components can be thought to combine in some way to provide the
observed time series. Assumptions can be made about these components both in behavior
and in how they are combined, which allows them to be modeled using traditional statistical
methods. These components may also be the most effective way to make predictions about
future values, but not always. In cases where these classical methods do not result in effective
performance, these components may still be useful concepts, and even input to alternate methods.

>#### <font color=#800080> Example: </font> <a class="anchor" id="Task-1"></a>

Arusha is a city up north in Tanzania, widely recognized for its several distinctive attributes and significant contributions to both Tanzania and East Africa. Specifically, it holds the distinguished reputation of being Tanzania's safari capital city and a popular stopover for adventurers who are preparing for a Kilimanjaro expedition. Given the impact of climate change, The Tanzania Tourist Board, which is the national tourism organization, seeks your expertise in examining climate patterns spanning the past three months, from 01 May 2023 to 31st July 2023. This assessment will inform their promotion of tourism activities while considering weather conditions.


1. What is Arusha sadly known for? Hint: ..."Rwanda, August 1993".


The weather data has been provided by the Tanzania Meteorological Authority(TMA). On an hourly basis, several variables have been measured:

    * The Wind Direction at 50 Meters (Degrees)  
    * The Wind Speed at 50 Meters (m/s) 
    * The Temperature at 2 Meters (Dregrees Celsius)
    * The Precipitation Corrected (mm/hour)  
    * Specific Humidity at 2 Meters (g/kg)

The data has been shared with you as a csv file called `arusha_hourlyseries.csv` .

2. Load it and tell us what you observe just from looking at the column headers.

In [None]:
#if you open the data within a spreadsheet, you will understand why we had to include the skiprows parameter.
arusha_data =  pd.read_csv('data/arusha_hourlyseries.csv' , skiprows=13)
arusha_data.head()

In [None]:
arusha_data.info()

We noticed that the **YEAR**, **MONTH** and **DAY** column are all displayed separately. We might to re-assemble them into a date column with the correspond data type. Prior to do that, let's rename those columbs appropriately.

In [None]:
arusha_data.columns = ['year' , 'month' , 'day' , 'hour' ,
                       'WD50M', 'T2M', 'WS50M', 'PRECTOTCORR','QV2M' ]
arusha_data.head(3)

In [None]:
arusha_data['Date'] = pd.to_datetime(arusha_data[["year", "month", "day"]])
arusha_data =  arusha_data[['Date']+ list(arusha_data.columns[:-1])]
arusha_data.head()

The same applies to the **hours** column. But here, we convert the integer representation into an hour column.

In [None]:
arusha_data['hour'] = pd.to_datetime(arusha_data['hour'], unit='h').dt.time
arusha_data.head(3)

In [None]:
arusha_data.info()

Let's look at the tendancy for each of the variables

In [None]:
import matplotlib.pyplot as plt
colors = ['r' , 'g' , 'b' , 'k' , 'cyan']
for var , col  in zip(arusha_data.columns[5:] ,colors):
    plt.figure(figsize=(15,4))
    plt.plot(arusha_data[var] ,  color=col)
    plt.title(var)

To get the flavour of the series, we might also want to investigate the start date and the end data of the series.

In [None]:
arusha_data["Date"].min()


In [None]:
arusha_data["Date"].max()

We could see that the data collection started on **01st May 2023** and ended on **01st August 2023** at midnight.

### Differencing time series


Differencing is a method of transforming a time series dataset. It can be used to remove the series dependence on time, so-called temporal dependence. This includes structures like trends and seasonality. Taking the difference between consecutive observations is called a lag-1 difference. The lag difference can be adjusted to suit the specific temporal structure. you can use the `.diff()` method to perform that.

Let's say we are interested in knowing the temperature difference between two consecutive days within the series.

In [None]:
arusha_data['T2M'].head()

In [None]:
arusha_data["Temp_diff"] = arusha_data["T2M"].diff()
arusha_data[['Date' , 'T2M', 'Temp_diff']].head(10)

As you could notice, the first entry of the `Temperature difference` column has `NaN` as a placeholder as it refers the starting date, which means there is no value to differentiate against. 

The differentiation becomes useful when we are concerned about removing the linear trend from the dynamics of the series. Let's see how the temperature difference looks like.

In [None]:
plt.figure(figsize=(15,4))
plt.plot(arusha_data['Temp_diff'])

One could also change the lag difference in order to adress a specific temporal structure. By passing an integer parameter to the `.diff()` method, we can compute the difference between two timely distant observations of the variable. For instance, the temperature difference every two hours, will be computed the following way:

In [None]:
arusha_data["T2M"].diff(2)

### Cumulating  time series

As opposed to differencing, we might be concerned with tracking the accumulated total of a variable over time, in order to understand the growth patterns, comparing year-to-date figures or observing the accumulation rate of a particular metric. 

For the case in hand, it won't make a lot of sense to compute the accumulated temperature over hours/days, as temperature is an intensive physical property of matter that is not additive. However, computing the accumulated precipitation over some time period could tell us about the pluviometry of that location or the total amount of rainfall within a specific time frame within that location. 


In [None]:
arusha_data["Prec_cum"] = arusha_data["PRECTOTCORR"].cumsum()
arusha_data[['Date' , 'PRECTOTCORR', 'Prec_cum']].head(10)
arusha_data.head()

In [None]:
plt.figure(figsize=(15,4))
plt.plot(arusha_data['Prec_cum'])

In [None]:
arusha_data[arusha_data['month'] == 5].tail()

This could in turn answer questions related to the monthly total amount of rainfall in arusha and we could also observe its dynamics.

In [None]:
monthly_Tot_rfall = arusha_data.groupby('month')['Prec_cum'].last()
monthly_Tot_rfall

In [None]:
plt.figure(figsize=(8,5))
plt.plot(['May' , 'June', 'July' , 'August'] , monthly_Tot_rfall ,'g')
plt.ylabel('precipitation')
plt.xlabel('month')
# plt.title('Monthly Total amount of rainfall')

### Rolling Mean

While the daily or hourly temperature fluctuations can be quite erratic due to various transient factors, we might be interested in identifying any longer-term trends or anomalies that might suggest broader climatic shifts. The rolling mean (often referred to as the moving average) is a powerful tool to tackle that issue.

By applying a rolling mean with a fixed window size, we can smooth out the day-to-day fluctuations and clearly see monthly patterns. For instance, let's look at the moving average given a windown of fixed size 12, in order to tell how the data behaves on a 12hours basis.

We can use the `.rolling()` method, which takes a parameter of the number of values to consider in the rolling window. In the example below, we take the mean of six values, in order to have the moving average every six consecutive hours of the day.

In [None]:
arusha_data['rolling_6h_mean'] = arusha_data['T2M'].rolling( window= 6).mean()
arusha_data[['Date' , 'T2M' , 'rolling_6h_mean']]

### Resampling the Time Series

There are several scenarios where one might need to resample the time series, which is a fundamental step in time series analysis. Resampling time series data refers to the process of changing the time-frequency or granularity of the data points. This can be either **an increase in frequency (upsampling)** or **a reduction (downsampling)**. By adapting the frequency, analysts can align datasets with different intervals for consistent comparisons or analyses.

### Downsampling the Time Series

Downsampling the series, which involves reducing the data's granularity, is particularly useful in improving computational efficiency, reducing noise and providing a higher-level view of patterns or trends.

1. Improving Computation Efficiency:

For very large datasets, computation can become resource-intensive and time-consuming. Downsampling can make data processing and modeling more manageable.

2. Noise Reduction:

High-frequency data can sometimes introduce a lot of noise. Downsampling, when combined with aggregation (like taking the mean), can help in smoothening the data and removing short-term fluctuations or noise, revealing longer-term trends or cycles.

3. Visualizations:

Too many data points can make visualizations cluttered and less informative. Downsampling can make plots and charts clearer and easier to understand.


A practical example would be to convert hourly data to daily data or daily data to monthtly or to any frequency that is dictated by the problem. For instance, let's look at the distribution of the wind speed on a day to day basis.  

In [None]:
plt.figure(figsize=(25, 10))
for day   in range(12):
        ax =  plt.subplot(4,3, day+1)
        plt.plot(arusha_data['WS50M'].values[24*day:24*(day+1)])
        #plt.xlabel('hour')
        plt.ylabel('wind speed')
        #plt.title('day ' +str(day+1))
        

Despite the wind's overall daily fluctuations, there may not be a significant change within two consecutive hours. This suggests that if we encounter one of the aforementioned situations, we could downsample the wind speed signal by recording it every two hours. Consequently, there would be twelve data points in a day instead of twenty-four.

It is achieving by doing the following:

In [None]:
arusha_per2h = arusha_data[::2]


Below is the resulting daily plots of wind speed

In [None]:
plt.figure(figsize=(25, 10))
for day   in range(12):
        ax =  plt.subplot(4,3, day+1)
        plt.plot(arusha_per2h['WS50M'].values[12*day:12*(day+1)])
        #plt.xlabel('hour')
        plt.ylabel('wind speed')

Another downsampling strategy involves reducing the frequency of data points, by computing an aggregate of a certain time interval. For instance, in case of low variation of the signals, we might consider only working with daily averages, which means that the 24 data points available within a day are therefore represented by their average value. To work that out effectively with dataframes, we would have to set the date column as dataframe index.

In [None]:
arusha_indexed= arusha_data.set_index(["Date"])
arusha_indexed.head(5)

And then, we can take the daily mean of the observations and see the corresponding plot

In [None]:
arusha_dsp1 = pd.DataFrame(arusha_indexed.resample("D")['T2M','WS50M' ].mean())#.reset_index()
arusha_dsp1.head(10)

Below we have the daily average of temperature over the course of a month.

In [None]:
import matplotlib
days =  ['day' + str(i+1) for i in range(30)]
plt.figure(figsize=(15,5))
plt.plot(days, arusha_dsp1['T2M'].head(30))
plt.xlabel('day')
plt.ylabel('temperature')
locator = matplotlib.ticker.MultipleLocator(4)
plt.gca().xaxis.set_major_locator(locator)
formatter = matplotlib.ticker.StrMethodFormatter("{x:.0f}")
plt.gca().xaxis.set_major_formatter(formatter)

Below we have the daily sum of precipitation over the course of a month.

In [None]:
arusha_dsp2 = pd.DataFrame(arusha_indexed.resample("D")['PRECTOTCORR','QV2M' ].sum())#.reset_index()
arusha_dsp2

In [None]:
days =  ['day' + str(i+1) for i in range(30)]
plt.figure(figsize=(15,5))
plt.plot(days, arusha_dsp2['PRECTOTCORR'].head(30))
plt.xlabel('day')
plt.ylabel('precipitation')
locator = matplotlib.ticker.MultipleLocator(4)
plt.gca().xaxis.set_major_locator(locator)
formatter = matplotlib.ticker.StrMethodFormatter("{x:.0f}")
plt.gca().xaxis.set_major_formatter(formatter)

### Upsampling the Time Series

On the other hand, upsampling increases the granularity, which can be beneficial for filling gaps in data or preparing data for certain models or applications that require a specific frequency. When resampling, the choice of method for filling or aggregating values—such as linear interpolation, mean, or sum—is crucial to ensure meaningful and accurate representation.


Upsampling is when the frequency of samples is increased (e.g., months to days).
Again, you can use the
`.resample()`
method.

Let's consider the downsamples, we could choose the downsampled temperature data that we performed earlier and try and resample it using the linear interpolation.

In [None]:
arusha_dsp1

In [None]:
arusha_intdaily = arusha_dsp1.resample("H").interpolate(method = "linear")
arusha_intdaily

In [None]:
plt.figure(figsize=(25, 10))
for day   in range(12):
        ax =  plt.subplot(4,3, day+1)
        plt.plot(arusha_data['WS50M'].values[24*day:24*(day+1)] ,label = 'true data')
        plt.plot(arusha_intdaily['WS50M'].values[24*day:24*(day+1)],label = 'intpd data')
        plt.ylabel('wind speed')
        plt.legend()

We can see that the linear interpolation is not the best representation of our observations. One should probably think of an polynomial interpolation and choose an convenient degree for the polynomial.

>#### <font color=#800080>Task 4:</font> <a class="anchor" id="Task-1"></a>


**SahelPower Co.** is an energy company located in Moundou,  the second largest city in Chad. They primarily relies on wind farms to generate electricity. The company understands that electricity demand is influenced by various factors, including ambient temperature as in hot weather, households tend to consume more energy.

However, their wind power generation platform is ran by a subisidiary start-up specialized in wind powerplants constructions called SaoWind. SaoWind provide wind-related data, and SahelPower Co. has the technology to convert it into wind energy for electricity usage.

Their overall goal is to optimize their electricity production, manage demand-supply gaps, and improve overall efficiency through analyzing closely the  datasets they have received from both SNE (Societe Nationale d'Electricite du Tchad) and SaoWind (wind speed and direction).

Here are a few additional information regarding the data files they have received:

* SaoWind:  

    1. Variables: Wind speed at 50M  , Wind Direction at 50M
    2. Resolution: hourly observations
    3. Date range: 01 Jan 2004 at 10AM  - 26 Feb 2004  at 9AM
    4. file name:  `saowind_data.csv`
    
* SNE 

    1. Variables: Electricity demand, Ambiant Temperature
    2. Resolution: hourly observations
    3. Date range: 01 Jan 2004 at 00AM  - 26 Feb 2004  at 11PM
    4. file name:  `snelec_data.csv`


1. Wind is often considered as one of the cleanest form of renewable energy? What are the pros and cons  of it?

2. You have noticed that data came from two different sheets. Merge them and report on the different steps you used to achieve that.

3. Plot the temperature and demand  seperately and comment on the plots


4. How do the average electricity demand and the ambient temperature vary across different weeks? Commenton  your findings.

5.  On days when the ambient temperature was above 25°C, was the electricity demand significantly higher than on days when it was below 25°C?


6. Plot the temperature differences between Saturday withing the observations.


7. How does the average electricity generation during daytime hours (e.g., 6 am to 6 pm) compare to the average generation during nighttime hours (e.g., 6 pm to 6 am)? 


8. The WD50M measure the direction of the wind towards the North Direction. However, for  wind power calculation, due to the singularity around  0 and 360 degrees, it was decided to convert the wind speed and direction into wind vector (x and y components) using the formula below:

$w_x = v \cdot \cos(\phi)$ 

$ w_y = v  \cdot \sin(\phi)$


where  $v$ is the wind velocity and $\phi$ is the wind angle. Compute the x and y computer components of the wind vector and plot the corresponding graphs.
